RVF Modeling Pipeline
Overview
This is the main targets pipeline script for building predictive models for Rift Valley fever (RVF) outbreaks in South Africa. It downloads processed environmental and outbreak data, trains machine learning models, and generates performance reports.
Pipeline Components
1. Model Data Targets
Downloads and prepares the RVF model dataset from the RVF Data Processing Pipeline
- Loads pre-processed data from the open-rvfcast S3 bucket
- Creates the base dataset for all subsequent modeling steps
2. Cross-Validation Targets
Nested Cross-Validation Approach
The RVF modeling pipeline implements a sophisticated nested cross-validation strategy that addresses both temporal and spatial dimensions:
Outer Loop (Expanding Window)
- Uses 4 temporal folds with an expanding training window
- Training data grows progressively from Fold 1 to Fold 4
- Each fold tests on a subsequent time period (shown in green)
- A separate holdout dataset (2018 onwards) is reserved for final validation
Inner Loop (Leave-One-Location-Out)
- Within each outer fold, performs spatial cross-validation
- Uses 5 location-based folds (POP 1-5 representing different populations/regions)
- Each inner fold trains on 4 locations and tests on the 5th
- Ensures the model can generalize to new geographic areas
This nested approach provides robust estimates of model performance by evaluating:
- Temporal generalization through the expanding window
- Spatial generalization through leave-one-location-out validation
- Combined spatio-temporal performance across multiple scenarios
3. Model Tuning Targets
(To be implemented)
Handles hyperparameter optimization and model selection:
- Data Splitting: Separates data into training (pre-2018) and holdout (2018+) sets
- XGBoost Components:
- Model specification with custom constraints
- Hyperparameter grid for tuning
- Recipe for feature engineering
- Metrics for model evaluation
- Advanced Features (commented):
- Interaction constraints to control feature relationships
- Monotonic constraints (e.g., ensuring positive area effect)
- Base score initialization based on outbreak prevalence
4. Model Fitting Targets
(To be implemented)
Will include:
- Final model training on full training set
- Model persistence and serialization
- Feature importance extraction
- Prediction generation
5. Model Evaluation Targets
(To be implemented)
Planned evaluations:
- ROC curves comparing different model specifications
- Performance metrics across temporal and spatial dimensions
- Calibration plots
- Feature importance analysis
- Cross-validation performance summaries
Model Interpretability
(To be implemented)
- DALEX integration for model explanations
- Ceteris paribus plots for understanding feature effects
- Validation of constraint effectiveness (e.g., checking if area effect remains constant)
- Feature interaction analysis
- Partial dependence plots
6. Report Targets
(To be implemented)
Will generate:
- Model performance summaries
- Visualization of results
- Comparison across different model configurations
- Documentation of model choices and rationale
7. Documentation Targets
(To be implemented)
Automatically generates project documentation:
- Renders the README.Rmd file
- Creates up-to-date documentation reflecting current pipeline state
- Ensures reproducibility through comprehensive documentation
Key Features
- Modular Design: Clear separation of concerns across different pipeline stages
- Constraint-Based Modeling: Incorporates domain knowledge through XGBoost constraints
- Comprehensive Validation: Multiple levels of cross-validation for robust performance estimates
- Interpretability Focus: Built-in support for understanding model behavior
- Reproducibility: Full documentation and dependency management